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ABSTRACT Microbial activities in soils, such as (incomplete) denitrification, represent major sources of nitrous oxide (N 2 0), a 
potent greenhouse gas. The key enzyme for mitigating N 2 0 emissions is NosZ, which catalyzes N 2 0 reduction to N 2 . We recently 
described "atypical" functional NosZ proteins encoded by both denitrifiers and nondenitrifiers, which were missed in previous 
environmental surveys (R. A. Sanford et al., Proc. Natl. Acad. Sci. U. S. A. 109:19709-19714, 2012, doi: 10. 1073/pnas. 121 1238109). 
Here, we analyzed the abundance and diversity of both nosZ types in whole-genome shotgun metagenomes from sandy and silty 
loam agricultural soils that typify the U.S. Midwest corn belt. First, different search algorithms and parameters for detecting 
nosZ metagenomic reads were evaluated based on in siZico-generated (mock) metagenomes. Using the derived cutoffs, 71 distinct 
alleles (95% amino acid identity level) encoding typical or atypical NosZ proteins were detected in both soil types. Remarkably, 
more than 70% of the total nosZ reads in both soils were classified as atypical, emphasizing that prior surveys underestimated 
nosZ abundance. Approximately 15% of the total nosZ reads were taxonomically related to Anaeromyxobacter, which was the 
most abundant genus encoding atypical NosZ-type proteins in both soil types. Further analyses revealed that atypical nosZ genes 
outnumbered typical nosZ genes in most publicly available soil metagenomes, underscoring their potential role in mediating 
N 2 0 consumption in soils. Therefore, this study provides a bioinformatics strategy to reliably detect target genes in complex 
short-read metagenomes and suggests that the analysis of both typical and atypical nosZ sequences is required to understand and 
predict N 2 0 flux in soils. 

IMPORTANCE Nitrous oxide (N 2 0) is a potent greenhouse gas with ozone layer destruction potential. Microbial activities control 
both the production and the consumption of N 2 0, i.e., its conversion to innocuous dinitrogen gas (N 2 ). Until recently, consump- 
tion of N 2 0 was attributed to bacteria encoding "typical" nitrous oxide reductase (NosZ). However, recent phylogenetic and 
physiological studies have shown that previously uncharacterized, functional, "atypical" NosZ proteins are encoded in genomes 
of diverse bacterial groups. The present study revealed that atypical nosZ genes outnumbered their typical counterparts, high- 
lighting their potential role in N 2 0 consumption in soils and possibly other environments. These findings advance our under- 
standing of the diversity of microbes and functional genes involved in the nitrogen cycle and provide the means (e.g., gene se- 
quences) to study N 2 0 fluxes to the atmosphere and associated climate change. 
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In recent years, anthropogenic emissions of greenhouse gases 
have received increasing attention because of their contribution 
to global warming (1,2). Prominent among these gases is nitrous 
oxide (N z O) (3), which also contributes to ozone depletion (4, 5). 
The anthropogenic fixation of dinitrogen (N 2 ), by means of the 
Haber-Bosch process, has led to the overuse of synthetic nitrogen- 
based fertilizers in agriculture (1, 6). As a consequence of the in- 
creased nitrogen (N) content of soils, atmospheric N 2 0 concen- 
trations have risen about 20% relative to preindustrial-era levels 
(2). N 2 0 emissions are largely the result of bacterial pathways 



controlling the nitrogen cycle. In particular, N 2 0 is generated pri- 
marily as a product of incomplete classic denitrification (i.e., 
N0 3 ~ reduction to N 2 0 via N0 2 ~ and NO) and secondarily as a 
by-product of dissimilatory nitrate reduction to ammonia 
(DNRA) and oxidation of ammonium to nitrite (nitrification) (7, 
8). Besides bacterial activities, abiotic processes and fungal deni- 
trification are thought to be sources of N z O (9, 10). Model predic- 
tions of N 2 0 consumption in terrestrial environments focus pri- 
marily on the N 2 0-to-N 2 reduction step, presently attributed to 
classical denitrifiers possessing nitrous oxide reductase (NosZ) (7). 
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Our previous work has revealed the existence of two phyloge- 
netically distinct NosZ clades, one encompassing typical Z-type 
NosZ proteins, which are commonly found in the Alpha-, Beta-, 
and Gammaproteobacteria, and the other encompassing atypical 
NosZ proteins present in diverse organisms representing different 
phyla. Further analysis of sequenced genomes revealed that most 
of the typical nosZ genes are found in bacteria capable of complete 
denitrification (i.e., encoding all the enzymes for converting 
N0 3 ~/N0 2 _ to N 2 ), whereas atypical nosZ genes are found in 
bacteria with more-diverse N metabolism, including those per- 
forming DNRA and missing the NO-generating nitrite reductase 
genes nirK and nirS (1 1, 12). Notably, atypical NosZ proteins have 
been shown to function as nitrous oxide reductases in several bac- 
teria, such as Wolinella succinogenes (13, 14), Geobacillus thermo- 
denitrificans (15), the soil isolate Anaeromyxobacter dehalogenans 
(11), and several Bacillus species isolated from soils (16, 17). 

Examination of the potential of microbial communities to re- 
duce N 2 0 to N 2 has traditionally been performed by evaluating 
nosZ gene and/or transcript presence or abundance by PCR (18, 
19). Primers targeting nosZ genes, however, were designed ac- 
cording to characterized typical nosZ gene sequences and there- 
fore missed the bulk of divergent atypical genes (11, 12). Further- 
more, measured N 2 0 emissions from soils were frequently lower 
than predictions based on (typical) NosZ transcript abundance 
and dynamics (20, 21). Therefore, it is likely that atypical NosZ 
abundance accounts, at least in part, for the discrepancy between 
predicted and observed N 2 0 fluxes. 

To circumvent the limitations and explore the total natural 
diversity of nosZ genes in the environment, we analyzed short- 
read metagenomic data sets from various soils and locations. Even 
though metagenomics can provide a relatively unbiased, PCR- 
independent view of the diversity and abundance of individual 
genes present in a sample, several technical challenges must first be 
addressed. For instance, in metagenomes of highly diverse micro- 
bial communities, such as those obtained from soils, the rates of 
false positives and false negatives when using similarity searches to 
detect individual genes in assembled contigs or unassembled short 
reads has not been rigorously evaluated, with the probable excep- 
tion of error rates assessed for the purpose of taxonomic classifi- 
cation, i.e., assigning a sequence to a taxon without necessarily 
evaluating its potential function and sequence diversity (22, 23). 
Cutoffs that might minimize the number of false-positive matches 
have not been determined for short-read metagenomes; instead, 
arbitrary, predetermined cutoffs based on E values (i.e., the like- 
lihood of finding a match by chance) represent the common prac- 
tice (24). 

The objective of the present study was to analyze the diversity 
and abundance of both typical and atypical nosZ genes in soils 
with contrasting physicochemical properties. To this end, we first 
developed a strategy based on similarity searches to determine 
appropriate cutoffs for accurately detecting metagenomic nosZ 
fragments by analyzing in silico metagenomes of known sequence 
composition. Subsequently, we applied this strategy and derived 
cutoffs to detect nosZ reads in metagenomes from two agricultural 
soils in the U.S. Midwest that have been subjects of an ongoing 
multiyear study to assess nitrogen cycling processes, as well as in 
publicly available metagenomes from various soil ecosystems. 
Our metagenomic, PCR-independent approach provided a com- 
prehensive and quantitative examination of the diversity and 
abundance of both typical and atypical nosZ genes in soils. 



TABLE 1 Comparison of BLASTn, BLASTx, and HMMER algorithms 
for retrieving typical and atypical nosZ reads from the in silico libraries I 
and II 



Method 


In silico library 


Sensitivity (%) 


Specificity (%) 


BLASTn 


I 


100.0 


99.9 




11 


100.0 


99.9 


BLASTx 


I 


98.4 


99.9 




11 


98.4 


99.9 


hmmsearch (protein) 


1 


48.9 


99.9 




II 


46.0 


99.9 



RESULTS 

Evaluating search algorithms and cutoffs for detecting nosZ 
genes in metagenomes. To determine the best algorithm and pa- 
rameters for detecting nosZ reads in 100-bp-long read meta- 
genomes, a reference database of manually verified nosZ genes that 
preclustered at 95% sequence identity was queried against two 
such in si/ico-generated data sets, libraries I (representing the 
whole genome of 122 NosZ-encoding organisms) and II (repre- 
senting the whole genome of 1,081 bacteria, including those in 
library I). From a receiver operating characteristic (ROC) curve 
analysis of true- and false-positive rates, the most appropriate bit 
score cutoff values were 107 and 52.2 for the BLASTn and BLASTx 
searches, respectively. These bit scores provided sensitivities (the 
fraction of correctly classified positive BLAST matches) of 100% 
and 98.4% for BLASTn and BLASTx, respectively, and a specificity 
(the fraction of correctly classified negative BLAST matches) of 
99.9% for both algorithms. A hidden Markov model (HMM) 
search resulted in a sensitivity of 46% and a specificity of 99.9% 
(Table 1). Additional HMM searches, using models that included 
more typical and atypical NosZ sequences (built from reference 
sequences in Table SI in the supplemental material) improved the 
number of nosZ reads retrieved from both in silico libraries 
(-56%). Irrespective of the HMM model employed, a lower frac- 
tion of nosZ reads was captured when the HMM was used than 
when BLAST searches were performed. Most of the reads missed 
by the HMM-based approach lacked highly conserved amino acid 
residues, and this accounted for the lower performance of HMM 
searches, consistent with expectations (HMM models rely heavily 
on conserved residues). Therefore, remaining analyses were per- 
formed using the BLAST algorithms. 

To test the limitations in retrieving metagenomic nosZ reads, 
single typical and atypical representative NosZ protein or nucleo- 
tide sequences were independently queried against library II. Al- 
though BLASTn had a specificity similar to that of BLASTx, the 
latter algorithm was able to capture 735% and 270% more reads 
(i.e., reads annotated as nosZ with a bit score greater or equal to the 
calculated cutoff for true positives) of the typical and the atypical 
references, respectively. Therefore, BLASTx was used in the re- 
maining analyses. Using typical NosZ from Bradyrhizobium ja- 
ponicum strain USD A 1 10 as a single reference, reads derived from 
74 out of 127 different alleles encoding NosZ were captured and 
found to be enriched in nosZ sequences closely related to the ref- 
erence sequence. In contrast, reads for only 32 out of 127 alleles 
encoding NosZ were captured when atypical NosZ from Anaero- 
myxobacter sp. strain Fw 109-5 was used as a reference in the 
analysis (Fig. 1, right panel). This atypical NosZ reference does not 
exhibit sequence identity to other target sequences in the 54 to 
82% amino acid identity range, and thus, the lack of moderately 
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FIG 1 Fraction of nosZ reads recovered from an in silico data set as a function of their relatedness to the reference query sequence. nosZ reads were retrieved from 
in si/ico-generated library II using Bradyrhizobium japonicum strain USDA 110 to represent typical NosZ (left) or Anaeromyxobacter sp. strain Fw 109-5 to 
represent atypical NosZ (right) reference sequences in a BLASTx search. The average bit score value (y axis) for the fraction of nosZ reads recovered (circle size) 
was plotted against the percentage of identity between each reference sequence and the full-length NosZ sequences from which the retrieved (matching) reads 
originated (target sequence) . The linear relationships observed between the fraction of reads detected and the percentage of identity between the full-length target 
and references sequences were calculated as follows: y = 0.464* + 40.73 {R 2 = 0.90) for typical sequences and y = 0.527x + 42.19 (-R 2 = 0.89) for atypical 
sequences, where y is the percentage of identity between the target and full-length reference sequences and x is the fraction of reads retrieved. 



related sequences among the target sequences accounts for the 
results obtained (Fig. 1, left panel). More importantly, a linear 
relationship was observed between the fraction of total reads de- 
tected and the level of divergence between the full-length refer- 
ence and target sequences (Fig. 1 ) , where 50% or more of the reads 
were detected when the two sequences shared more than 64 or 
68% sequence identity for typical or atypical NosZ reference se- 
quences, respectively. 

Further examination of the reads recruited along the typical 
NosZ reference (Fig. 2) showed that the true-positive matches 
(i.e., reads derived from nosZ genes with a bit score greater than 
52.2) were evenly distributed along the NosZ reference sequence. 
The N terminus of the reference sequence (1 to 60 amino acid 
positions) was rarely covered by either true- or false-positive 
matches, suggesting that this part of the gene should be avoided 
when assessing nosZ abundance in metagenomes. 

Abundance of nosZ genes in sandy and silty soils. A charac- 
terization of the taxa, coverage, and assembly statistics of the two 
soil metagenomes is described in the supplemental material. Both 
soil metagenomes were queried against a 95%-identical preclus- 
tered set of reference NosZ sequences. All matches having a bit 
score greater than the calculated cutoff determined based on the in 
silico library analysis were identified as nosZ reads and classified as 



typical or atypical depending on their best match. Atypical nosZ 
reads were clearly the most abundant, comprising 72.9% and 
89.6% of the total nosZ reads found in the sand and silt loam soil 
metagenomes, respectively (Fig. 3). Further, 97% of the nosZ reads 
found in both soil metagenomes (4,929 and 7,280, total, in the 
sandy and silty loam soils, respectively) were recruited by 72 of the 
105 NosZ reference sequences, revealing that most of the diversity 
covered by the references was represented in both soils (Table S2). 
In addition, both soil samples showed similar estimated absolute 
abundances for nosZ reads: -1.4 X 10~ 5 and 2.1 X 10~ 5 reads of 
all reads for the Havana sand and Urbana silt loam, respectively 
(Fig. 3). The ratio of nosZ reads to single-copy housekeeping gene 
reads indicated that approximately 16% of the soil bacterial ge- 
nomes harbored a nosZ gene (Table S3). 

Phylogenetic analysis of the atypical nosZ reads showed that 
closely related genes found in members of the Anaeromyxobacter, 
Gemmatimonas, Opitutus, and Hydrogenobacter genera were the 
most abundant in both soil samples (Fig. 4). Additionally, less 
abundant genera, such as Bradyrhizobium and Rhodopseudomo- 
nas, both known to harbor typical nosZ genes, were also present in 
both soils. Remarkably, atypical nosZ reads affiliated with Anaero- 
myxobacter represented 12.7% and 15.2% of the total nosZ reads 
found in both sand and silt loam soils, respectively. The most 
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FIG 2 Coverage of matching nosZ reads from library II along the Bradyrhi- 
zobium japonicum NosZ reference sequence. Reads from library II matching 
the Bradyrhizobium japonicum strain USDA 110 NosZ reference are plotted 
according to their bit score values. Blue and green lines represent reads origi- 
nating from nosZ genes and other genes, respectively. The solid gray line rep- 
resents the average percentage of identity of nosZ reads (blue lines) matching 
the NosZ reference in a 3-amino-acid window. 



abundant typical nosZ reads were assigned to the Ralstonia (3.2%), 
Thiobacillus (3.1%), Bradyrhizobium (1.6%), and Rhodopseudo- 
monas (1.7%) genera (Table S2). 

nosZ diversity and abundance in other soil metagenomes. In 
general, atypical nosZ reads were more abundant than other reads 
in the soil metagenomes evaluated (Fig. 5). The frozen deep-soil 
permafrost metagenomes (core 1 sample in reference 25) showed 
a greater abundance for typical nosZ reads (-80% of total NosZ); 
however, atypical nosZ reads predominated in the upper or active 
layer (-74% of total nosZ sequences). Interestingly, after induced 
thawing, the microbial communities at both depths showed a 
small increase in the relative abundance of atypical nosZ reads. 
Furthermore, except with the boreal forest biome, several biomes 
studied by Fierer and colleagues (26), including tropical forest, 
polar and hot desert, arctic tundra, and temperate grassland, 
showed higher abundances of atypical than of typical nosZ reads. 

DISCUSSION 

Importance of atypical NosZ. The discovery of functional atypi- 
cal NosZ proteins has opened the possibility that a much larger 
number of microorganisms with previously unaccounted N 2 0- 
reducing potential contribute to lessening the N 2 0 emissions to 
the atmosphere than previously expected (12). The abundance 
and diversity of atypical nosZ genes were likely missed in previous 
PCR-based surveys because typical nosZ sequences presented the 
basis for primer design (11, 12) and the two nosZ types share only 
60.9% ± 8.2% nucleotide identity, on average. In the present 
PCR-independent metagenome analysis, atypical NosZ sequences 
were more abundant ( > 73% of total nosZ reads) than their typical 
counterparts, not only in two agricultural soils differing in physi- 




FIG 3 Relative abundances for typical and atypical nosZ genes in Havana sand 
and Urbana silt loam soil metagenomes. All metagenomic nosZ reads from soil 
were classified as typical or atypical according to their best match against a 
reference database of NosZ sequences, and the calculated relative abundances 
of the two gene types are shown (primary y axis, bars). The absolute abun- 
dance, i.e., number of identified nosZ reads per million of all reads in each 
metagenome is also shown (secondary y axis, dots). 



cochemical properties representative of many regions in the Mid- 
west United States but also in soils from distant geographic loca- 
tions representing a variety of habitats. Our results were also 
consistent with the widespread presence of atypical nosZ genes, 
previously hypothesized based on the number of genomes found 
to encode atypical NosZ proteins among the available genome 
sequences (11). Therefore, these findings reveal an unexpectedly 
high potential for N z O reduction mediated by atypical NosZ in a 
variety of soil habitats. 

It is important to note that our study, being solely based on 
DNA sequences, evaluated N z O reduction potential as opposed to 
the specific in situ activity of NosZ enzymes, typical or atypical. 
Since negative (purifying) selection efficiently removes unused 
genes from genomes in microbial populations, the high abun- 
dance of atypical nosZ sequences found in different soil samples 
underpins their functional and/or ecological potential (e.g., 
Fig. 5). Given also that N 2 0 reduction is the only known biochem- 
ical function carried by NosZ ( 1 1 ) , our results collectively suggest 
that atypical NosZ proteins are as important, if not more impor- 
tant, than their typical counterparts in controlling N 2 0 fluxes in 
soils and likely other environments. Our study also provided the 
means (e.g., gene sequences for primer design and a bioinformat- 
ics strategy) to facilitate future studies of the effect of environmen- 
tal conditions on NosZ activity and dynamics toward a more pre- 
dictive understanding of the nitrogen cycle. 

The most abundant nosZ genes in the agricultural soils studied 
here are affiliated with the Anaeromyxobacter genus (Fig. 4). Mem- 
bers of this genus are widely distributed in soils with different 
physical and chemical characteristics as well as soils from a variety 
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FIG 4 Phylogenetic affiliations for the five most abundant genera harboring typical and atypical nosZ genes in Havana sand and Urbana silt loam soil 
metagenomes. Metagenomic reads were assigned to a genus based on their best match against a reference database of NosZ sequences, and the normalized 
number of reads (based on the size of the data sets) assigned to each genus is shown for Havana sand (A) and Urbana silt loam (B) soils. Red and blue bars 
represent atypical and typical genes, respectively. 



of geographic locations (11, 27, 28). The high abundance of nosZ 
genes affiliated with members of the nondenitrifying Anaero- 
myxobacter genus is consistent with recent PCR surveys that em- 
ployed primers targeting atypical nosZ sequences (11, 12) and 
A. dehalogenans 16S rRNA gene sequences (11). Further, a high 
phylogenetic congruence between typical nosZ and 16S rRNA 
gene phylogenies was previously reported (29, 30). Therefore, 
abundant atypical nosZ metagenomic sequences (Fig. 4 and Ta- 
ble S2) that have distant matches to homologs of known NosZ- 



encoding taxa may be harbored by novel taxa. The sequences re- 
ported here should facilitate the identification of new taxa, 
expanding our understanding of phylogenetic diversity in NosZ- 
encoding soil organisms. The majority of the abundant atypical 
nosZ reads that were assignable to known taxa were found in po- 
tentially nondenitrifying genomes of genera such as Anaeromyxo- 
bacter, Ignavibacterium, Opitutus, Dyadobacter, and Gemmatimo- 
nas, which were overlooked in previous PCR surveys targeting 
typical nosZ genes. Therefore, the inclusion of these unaccounted 
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FIG 5 Relative abundances of typical and atypical nosZ sequences in various soil ecosystems. nosZ reads were retrieved from available Illumina short-read 
metagenomes by following the same approach used with the Illinois soils reported in this study. The bars represent the probability of finding typical (blue) or 
atypical (red) sequences as a proxy for their relative abundances, and the error bars represent the variance of the sample mean for each soil metagenome. Data 
sets were obtained from Mackelprang et al. (25) (library I), Fierer et al. (26) (library II), and Luo et al. (40) (library III). 
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N 2 0 reducers in future environmental studies may help bridge the 
gap between measured N z O emissions and denitrification poten- 
tial based solely on typical nosZ gene or nosZ transcript measure- 
ments. 

Our results for the Illinois soils are based on a composite sam- 
ple comprised of equal DNA quantities extracted from multiple 
subsamples, intended to capture spatial heterogeneity at each field 
site (at both centimeter depths and meter landscape scales). Since 
the soils were taken at a single time point, it is likely that some of 
the trends reported here for these soils (e.g., abundance of specific 
NosZ-encoding taxa, average coverage of metagenomes, etc.) 
might differ temporally, given that agricultural soils typically re- 
ceive seasonal management inputs. Nonetheless, the high abun- 
dance of atypical nosZ sequences found in these agricultural soil 
metagenomes and in soils from different locations and of diverse 
physicochemical characteristics (e.g., Fig. 5) emphasize their po- 
tential importance for nitrogen cycling. 

A bioinformatics methodology to detect target genes. Our 
evaluation of in siZzco-generated data sets of known species and 
gene composition showed that both BLASTn and BLASTx algo- 
rithms represent reliable means of detecting reads encoding nosZ 
(or other target) genes, albeit with their own strengths and limi- 
tations. The selection of the most appropriate algorithm should 
consider the computational resources available. For example, 
BLASTx is more computationally demanding than BLASTn but 
can capture more-divergent sequences if more distantly related 
sequences/organisms are targeted. However, BLASTn similarity 
searches are less affected by frameshift-introducing sequencing 
errors, which might be significant in short-read data even after 
stringent quality read trimming. Frameshift correction tools such 
as FrameBot (31), HMM-Frame (32), and FragGeneScan (33) are 
available to correct these sequencing artifacts and also predict 
protein-coding regions in short reads. 

The low performance of the profile-based approach (HMM) 
versus that of BLASTx (Table 1) is presumably attributable to the 
lower sensitivity of the former with (i) short sequences, (ii) se- 
quencing errors creating frameshifts, and (iii) a reduced fraction 
of highly conserved amino acid residues specific to the protein of 
interest, as suggested previously (32, 34). Regardless of these lim- 
itations, HMM-based searches are preferable when targeting dis- 
tantly related homologs and using full-length sequences (e.g., tar- 
geting complete gene sequences recovered in assembled contigs). 

Recommendations for the study of other genes. The afore- 
mentioned approach based on in silico metagenomes, the BLAST 
algorithm, and ROC analysis can be modified for other functional 
genes of interest. Special attention should be given to conserved 
domains in the target gene or protein that are shared with other 
nontarget genes/proteins. As shown in Fig. 2, no false-positive 
matches were observed for the B. japonicum strain USD A 110 
NosZ reference sequence for bit score values above the calculated 
cutoff. The latter finding indicates that no high-identity domains 
or motifs are shared with other non-NosZ sequences. Other genes 
may deviate from this pattern, and a case-by-case evaluation (e.g., 
Fig. SI) is recommended. Our approach, when modified to use 
sliding windows along the sequence of the reference gene, can also 
determine appropriate cutoffs for different regions of the se- 
quence and identify regions that represent reliable targets for fur- 
ther analyses (e.g., low abundance of false-positive matches and 
PCR primer design). 

In silico data sets simulating different error rates, insert sizes, 



and coverage can easily be constructed to mimic different short- 
read sequencing technologies or methodologies. Nonetheless, 
simulating the diversity and variable abundances of individual 
taxa of real soil metagenomes remains challenging (e.g., our in 
silico data sets had substantially less diversity than the real metag- 
enomes used in the study). The expansion of in silico library I by 
13.6 million reads from 959 sequenced genomes not containing a 
nosZ gene (i.e., in silico library II) did not increase the number of 
false-positive matches obtained for nosZ (Table 1 ) , suggesting that 
a small number of, if any, false-positive matches should be ex- 
pected for real metagenomes. In addition, having a comprehen- 
sive and well-curated set of protein or gene reference sequences is 
a key requirement for robust assessment of the best cutoffs and 
parameters to effectively retrieve reads encoding the gene(s) of 
interest. 

In conclusion, we developed a bioinformatics approach for the 
detection of target genes in short-read metagenomes. This meth- 
odology can be extended to the study of any other genes or pro- 
teins of interest. The high abundance of the previously unac- 
counted atypical nosZ genes in the soil samples suggests that 
nondenitrifiers and denitrifiers that harbor atypical nosZ genes 
may contribute more than previously thought to the reduction of 
N 2 0 to innocuous N 2 gas. 

MATERIALS AND METHODS 

Samples, DNA extraction, and sequencing. In November 2011, agricul- 
tural soil samples were collected from two sites with long histories of 
commercial corn and soybean production in the U.S. Midwest corn belt: 
(i) Havana, IL (93% sand; 7% clay; lat 40.296, long -89.944; elevation, 
150 m), and (ii) Urbana, IL (21% sand; 69% silt; 10% clay; lat 40.075, long 
— 88.242; elevation, 222 m). In order to provide a metagenome represen- 
tative of the total soil profile and minimize the effect of sample heteroge- 
neity, soil was collected as three replicate cores (2.5 cm by 30 cm) taken at 
three locations 30 m apart within each field plot (9 cores, total, per field), 
with each core partitioned into four depths (0 to 5 cm, 5 to 10 cm, 10 to 
20 cm, and 20 to 30 cm). Soil physicochemical characteristics were mea- 
sured at each depth (A&L Laboratories, Ft. Wayne, IN) (Table S4). DNA 
was extracted from -0.5 g of soil from each fraction according to a previ- 
ously described phenol-chloroform extraction and purification protocol 
(35), and approximately equal quantities of DNA from each fraction 
based on agarose gel quantification were pooled to create one composite 
sample for each soil type. The Illumina TruSeq and Nextera DNA library 
preparation protocols were used for the Havana sand and Urbana silt 
loam samples, respectively. Sequencing of composite DNA samples was 
performed using the Illumina HiSeq 2000 platform, resulting in 38.4 and 
40.2 Gbp of 100-bp long paired-end reads for the Havana sand and Ur- 
bana silt loam samples, respectively. 

Sequence processing. An in-house Python script (available at http:// 
enve-omics.gatech.edu) was used for quality trimming of raw Illumina 
reads as described previously (36) . In brief, this script trims from both the 
5' and the 3' end of a sequence using an average Phred score threshold of 
20 in 3-bp-long windows and discards resulting sequences shorter than 
50 bp (Table S5). The same trimming strategy was applied to publicly 
available metagenomes for consistency. All BLAST+ (37) and HMMER 
(38) analyses were based on both single reads, when the corresponding 
sister read was not available or discarded after the trimming step, and 
pair-end reads. 

In silico libraries and cutoff calculation. An in-house Python script 
was used to generate in silico libraries from available complete genomes in 
the NCBI database as of April 2013 (2,355 sequenced genomes) as de- 
scribed previously (36). Briefly, this script simulates an Illumina run by 
generating 100-bp paired-end reads with a sequencing error of 0.5%, an 
insert size of 500 bp, and a user-defined coverage of 3-fold. The script also 
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reports the coordinates of the genome from which the reads were gener- 
ated, so that gene encoding information for each read is available (based 
on NCBI protein table files or ptt files). The first in si/ko-generated data set 
(library I) was built based on seven bacterial plasmids and 115 chromo- 
somes previously determined to bear a nosZ gene ( 12). In silico library II 
was constructed using the 122 DNA sequences from library I and an ad- 
ditional 959 sequenced chromosomes that did not encode NosZ (con- 
firmed independently by BLASTp analysis ) . Libraries I and II had a total of 
7,460 NosZ-encoding reads and were -14 and -136 million reads in size, 
respectively. 

BLAST analyses were performed using the BLAST+ 2.2.7 release with 
the following settings: a word size of 7, a penalty of — 2, no dust, an E value 
cutoff of 0.001, and an x-drop gap of 150. BLASTx settings were no SEG 
program, an E value cutoff of 0.001, and a word size of 3. Previously 
described nosZ nucleotide or protein references from complete genomes 
were clustered at 95% sequence identity, and the longest representative 
sequence from each cluster was used to construct a reference database, 
consisting of 54 typical, 47 atypical, and 4 halophilic archaeal representa- 
tive reference sequences. All reads originating from a nosZ gene, whether 
located on a chromosome or a plasmid, that matched a nucleotide or 
protein sequence reference were classified as true positives. Reads not 
originating from a nosZ reference sequence that matched a reference se- 
quence were classified as false positives. The numbers of true and false 
positives obtained from each algorithm (performance) were evaluated by 
the receiver operating characteristic (ROC) curve using the R pROC li- 
brary (39). The bit score cutoff that maximized performance was calcu- 
lated as the line that maximized the distance to the identity line (i.e., the 
nondiscriminatory diagonal line where sensitivity and specificity are 
equal) according to the Youden method for a partial area under the curve 
(pAUC) between 90% and 100% of specificity (39) (see Fig. SI for a flow 
chart of the approach). 

A hidden Markov model (HMM) based on the sequences of six func- 
tionally characterized NosZ proteins (Bradyrhizobium japonicum USDA 
110 27375426, WolineUa succinogenes 46934822, Paracoccus denitrificans 
2833444, Achromobacter cycloclastes 37538302, and Anaeromyxobacter de- 
halogenans 2CP-C 86158824) was built with HMMER 3.0 and used to 
query translated reads from libraries I and II for nosZ matches based on 
the hmmsearch algorithm (38) (Table 1). ROC analyses were not per- 
formed for HMM searches due to the high sensitivity and specificity ob- 
tained after each search. 

Detecting nosZ reads in metagenomes. Publicly available metag- 
enomes from Alaskan permafrost (25), soilbiomes (26), and soils exposed 
to a decade of warming (40) were downloaded from the DOE Joint Ge- 
nome Institute (http://www.jgi.doe.gov), MG-RAST (http:// 
metagenomics.anl.gov), and NCBI Sequence Read Archive Web servers, 
respectively. NosZ-encoding reads (or nosZ reads for simplicity) in the 
above-named metagenomes were identified by BLAST searches against 
the NosZ reference sequences clustered by 95% identity (Table SI) and 
classified as typical or atypical NosZ sequences based on their best match. 
To account for differences in the numbers of sequences among the pub- 
licly available metagenomes, the presence or absence of each type of nosZ 
read was represented as a binomial distribution for each metagenome. 
Assuming independence in the presence or absence of each type of nosZ 
gene in each soil sample, the probability of finding either type was calcu- 
lated from the frequency of nosZ reads detected in each metagenome (i.e., 
a probability closer to 1 implies a higher abundance for the corresponding 
type of nosZ gene in the metagenome). To account for differences in the 
numbers of reads for each metagenome, the standard deviation of the 
sample mean was calculated for each distribution. 

Fractions of genomes containing a nosZ gene. To estimate the frac- 
tions of the microbial populations in the soil community with nosZ genes 
in their genomes, the following approach was used. Sequences of three 
single-copy housekeeping genes (dnaK, recA, and rpoB) were used as ref- 
erences to query each metagenome. The reference set for each housekeep- 
ing gene included sequences from 30 different bacterial species (denoted 



by an asterisk in Table SI ) that also contained a typical or an atypical nosZ 
gene (i.e., half of the species in the set harbored a typical nosZ and the 
other half an atypical gene). The total number of matches obtained in a 
BLASTn search (settings were as follows: no dust, a word size of 7, a 
penalty of — 2, a maximum number of target sequences of 1, an x-drop of 
150, and anE value ofO.OOl) for each set of housekeeping and nosZ genes 
was normalized by the average length of the query (reference) sequences. 
The fraction of the microbial community harboring nosZ genes was cal- 
culated as the ratio of the normalized number of nosZ reads to the number 
of reads assigned to each of the housekeeping genes (assuming one nosZ 
gene copy per genome, which is the case for >97% of the analyzed ge- 
nomes in Table SI; see also Table S3). 

Both the Havana sand and Urbana silt loam metagenomes are avail- 
able under accession numbers SRR1 152189 and SRR1 153387 in the Se- 
quence Read Archive server. 
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